Synchronization arbiter for proactive synchronization within a multiprocessor computer system

ABSTRACT

A synchronization arbiter may be used in a computer system including one or more processors configured to request exclusive access to a given memory resource. The request may include one or more addresses associated with the memory resource. The synchronization arbiter includes an address storage that may store sets of addresses. Each address may correspond to a respective memory resource to which a requestor has acquired exclusive access. The address storage may further store count values, each associated with a respective set of addresses, and each may be indicative of a number of requesters contending for any address in the respective set of addresses. If any of the one or more addresses matches any address in the sets of addresses, control logic may return the count value associated with the matching address to the requestor.

This application claims the benefit of U.S. Provisional Application No.60/710,548, filed on Aug. 23, 2005.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to microprocessors and, more particularly, toprocess synchronization processors in a multiprocessor system.

2. Description of the Related Art

Modern microprocessor performance has increased steadily and somewhatdramatically over the past 10 years or so. To a large degree, theperformance gains may be attributed to increased operating frequency andmoreover, to a technique known as deep pipelining. Generally speaking,deep pipelining refers to using instruction pipelines with many stages,with each stage doing less, thereby enabling the overall pipeline toexecute at a faster rate. This technique has served the industry well.However, there are drawbacks to increased frequency and deep pipelining.For example, clock skew and power consumption can be significant duringhigh frequency operation. As such, the physical constraints imposed bysystem level thermal budget points, and the increased difficulty inmanaging clock skew may indicate that practical limits of the techniquemay be just around the corner. Thus, industry has sought to increaseperformance using other techniques. One type of technique to increaseperformance is the use of multiple core processors and more generallymultiprocessing.

As computing systems employ multiprocessing schemes with more and moreprocessors (e.g., processing cores), the number of requestors that mayinterfere or contend for the same memory datum may increase to such anextent that conventional methods of process synchronization may beinadequate. For example, when a low number of processors are contendingfor a resource, simply locking structures may provide adequateperformance to critical sections of code. For example, locked arithmeticoperations on memory locations may be sufficient. As the scale ofmultiprocessing grows, these primitives become less and less efficient.To that end, more advanced processors include additions to theinstruction set that include hardware synchronization primitives (e.g.,CMPXCHG, CMPXCHG8B, and CMPXCHG16B) that are based on atomicallyupdating a single memory location. However, we are now entering therealm in which even these hardware primitives may not provide the kindof performance that may be demanded in high-performance, high processorcount multiprocessors.

Many conventional processors use synchronization techniques based on anoptimistic model. That is, when operating in a multiprocessorenvironment, these conventional processors are designed to operate underthe assumption that they can achieve synchronization by repeatedlyrerunning the synchronization code until no interference is detected,and then declare that synchronization has been achieved. This type ofsynchronization may incur an undesirable waste of time, particularlywhen many processors are attempting the same synchronizing event, sinceno more than one processor can make forward progress at any instant intime. As such, different synchronization techniques may be desirable.

SUMMARY

Various embodiments of a synchronization arbiter for proactivesynchronization in a computer system are disclosed. In one embodiment,the synchronization arbiter may be used in a computer system includingone or more processors each configured to request exclusive access to agiven memory resource. The request may include one or more addressesassociated with the given memory resource. The synchronization arbiterincludes and address storage, a compare unit, and control logic. Theaddress storage may store a plurality of sets of addresses, and eachaddress may correspond to a respective memory resource to which arequestor has acquired exclusive access. In addition, the addressstorage may further store a plurality of count values each associatedwith a respective set of addresses. Each count value may be indicativeof a number of requesters contending for any address the respective setof addresses. The compare unit may compare each of the one or moreaddresses in the request to each address stored in the address storage.If any address of the one or more addresses matches any address in thesets of addresses, the control logic may return to the requestor, thecount value associated with the matching address.

In one specific implementation, the control logic may return apredetermined count value such as zero, for example, to the requestor inresponse to no address of the one or more addresses matching any addressin the sets of addresses.

In another embodiment, the synchronization arbiter may be used in acomputer system including one or more processors each configured torequest exclusive access to a given memory resource. The request mayinclude one or more addresses associated with the given memory resource.The synchronization arbiter includes and address storage, a compareunit, and control logic. The address storage may store a plurality ofsets of addresses, and each address may correspond to a respectivememory resource to which a requestor has acquired exclusive access. Inaddition, the address storage may further store a plurality of countvalues. Each count value may be associated with a respective address ofeach set of the plurality of sets of addresses. Further, each countvalue may be indicative of a number of requesters contending for anyaddress the respective set of addresses. The compare unit may compareeach of the one or more addresses in the request to each address storedin the address storage. If any address of the one or more addressesmatches any address in the sets of addresses, the control logic mayreturn to the requester, the count value associated with the matchingaddress.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of one embodiment of a computer system.

FIG. 2 is a block diagram depicting further details of an embodiment aprocessing node of FIG. 1.

FIG. 3 is a flow diagram that describes operation of one embodiment ofthe computer system shown FIG. 1 and FIG. 2.

FIG. 4 is a flow diagram that describes operation of one embodiment ofthe computer system shown FIG. 1 and FIG. 2 in response to receiving acoherency invalidation probe.

While the invention is susceptible to various modifications andalternative forms, specific embodiments thereof are shown by way ofexample in the drawings and will herein be described in detail. Itshould be understood, however, that the drawings and detaileddescription thereto are not intended to limit the invention to theparticular form disclosed, but on the contrary, the intention is tocover all modifications, equivalents, and alternatives falling withinthe spirit and scope of the present invention as defined by the appendedclaims. It is noted that the word “may” is used throughout thisapplication in a permissive sense (i.e., having the potential to, beingable to), not a mandatory sense (i.e., must).

DETAILED DESCRIPTION

To enable the construction of high performance synchronization methodsin software, a set of instructions, which may be referred to as anadvanced synchronization facility may be used. The facility may supportthe construction of non-Blocking synchronization, WaitFreesynchronization, Transactional Memory, along with the construction ofvarious forms of Compare and Swap primitives typically used in theconstruction of these methods. The facility allows construction (insoftware) of a large variety of synchronization primitives.

Moreover, the advanced synchronization facility may enable software toprogram a large variety of synchronization kinds. Each synchronizationkind may directly specify: the cache lines needed for successfulcompletion, a sequence point where failures can redirect control flow, adata modification section where the result of the successful criticalsection is performed, and a sequence point where success is made visibleto the rest of the system making the whole sequence of instructionsappear to be atomic.

Accordingly, the functionality of the advanced synchronization facilitymay enable the acquisition and release of multiple cache lines withwrite permission associated with a critical section substantiallysimultaneously as seen by other processors/cores. This process may bereferred to as Linearizing. After acquisition, several modifications canbe performed before any other interested party may observe any of themodifications to any of the specified multiple cache lines. Between theacquisition and the release, no other processors are allowed to bemanipulating these same lines (e.g. have write permission). A similarmethod could have been performed by not sending HyperTransport™ SourceDone messages for the associated lines and thereby preventing concurrentaccesses. However, these solutions lead to deadlock and/or livelock, ortimeouts. Thus, a computer system including processors and processorcores that may implement the advanced synchronization facility isdescribed below.

Turning now to FIG. 1, an embodiment of a computer system 100 is shown.Computer system 100 includes several processing nodes 312A, 312B, 312C,and 312D. Each of processing node 312A-312D is coupled to a respectivememory 314A-314D via a memory controller 316A-316D included within eachrespective processing node 312A-312D. Additionally, processing nodes312A-312D include interface logic (IF) used to communicate between theprocessing nodes 312A-312D. For example, processing node 312A includesinterface logic 318A for communicating with processing node 312B,interface logic 318B for communicating with processing node 312C, and athird interface logic 318C for communicating with yet another processingnode (not shown). Similarly, processing node 312B includes interfacelogic 318D, 318E, and 318F; processing node 312C includes interfacelogic 318G, 318H, and 318I; and processing node 312D includes interfacelogic 318J, 318K, and 318L. Processing node 312D is coupled tocommunicate with a plurality of input/output devices (e.g. devices320A-320B in a daisy chain configuration) via interface logic 318L.Other processing nodes may communicate with other I/O devices in asimilar fashion. Processors may use this interface to access thememories associated with other processors in the system. It is notedthat a component that includes a reference numeral followed by a lettermay be generally referred to solely by the numeral where appropriate.For example, when referring generally to the processing nodes,processing node(s) 312 may be used.

Processing nodes 312 implement a packet-based link for inter-processingnode communication. In the illustrated embodiment, the link isimplemented as sets of unidirectional lines (e.g. lines 324A are used totransmit packets from processing node 312A to processing node 312B andlines 324B are used to transmit packets from processing node 312B toprocessing node 312A). Other sets of lines 324C-324H are used totransmit packets between other processing nodes as illustrated inFIG. 1. Generally, each set of lines 324 may include one or more datalines, one or more clock lines corresponding to the data lines, and oneor more control lines indicating the type of packet being conveyed. Thelink may be operated in a cache coherent fashion for communicationbetween processing nodes or in a non-coherent fashion for communicationbetween a processing node and an I/O device (or a bus bridge to an I/Obus of conventional construction such as the PCI bus or ISA bus).Furthermore, the link may be operated in a non-coherent fashion using adaisy-chain structure between I/O devices as shown (e.g., 320A and320B). It is noted that in an exemplary embodiment, the link may beimplemented as a coherent HyperTransport™ link or a non-coherentHyperTransport™ link, although in other embodiments, other links arepossible.

I/O devices 320A-320B may be any suitable I/O devices. For example, I/Odevices 320A-320B may include devices for communicating with anothercomputer system to which the devices may be coupled (e.g. networkinterface cards or modems). Furthermore, I/O devices 320A-320B mayinclude video accelerators, audio cards, hard or floppy disk drives ordrive controllers, SCSI (Small Computer Systems Interface) adapters andtelephony cards, sound cards, and a variety of data acquisition cardssuch as GPIB or field bus interface cards. It is noted that the term“I/O device” and the term “peripheral device” are intended to besynonymous herein.

Memories 314A-314D may comprise any suitable memory devices. Forexample, a memory 314A-314D may comprise one or more RAMBUS DRAMs(RDRAMs), synchronous DRAMs (SDRAMs), DDR SDRAM, static RAM, etc. Thememory address space of computer system 300 is divided among memories314A-314D. Each processing node 312A-312D may include a memory map usedto determine which addresses are mapped to which memories 314A-314D, andhence to which processing node 312A-312D a memory request for aparticular address should be routed. Memory controllers 316A-316D maycomprise control circuitry for interfacing to memories 314A-314D.Additionally, memory controllers 316A-316D may include request queuesfor queuing memory requests. Memories 314A-314D may store codeexecutable by the processors to implement the functionality as describedin the preceding sections.

It is noted that a packet to be transmitted from one processing node toanother may pass through one or more intermediate nodes. For example, apacket transmitted by processing node 312A to processing node 312D maypass through either processing node 312B or processing node 312C asshown in FIG. 1. Any suitable routing algorithm may be used. Otherembodiments of computer system 100 may include more or fewer processingnodes then the embodiment shown in FIG. 1. Generally, the packets may betransmitted as one or more bit times on the lines 324 between nodes. Abit time may be the rising or falling edge of the clock signal on thecorresponding clock lines. The packets may include command packets forinitiating transactions, probe packets for maintaining cache coherency,and response packets from responding to probes and commands.

In one embodiment, processing nodes 312 may additionally include one ormore processor cores (shown in FIG. 2). It is noted the processor coreswithin each node may communicate via internal packet-based linksoperated in the cache coherent fashion. It is further noted thatprocessor cores and processing nodes 312 may be configured to share any(or all) of the memories 314.

In one embodiment, one or more of the processor cores may implement thex86 architecture, although other architectures are possible andcontemplated. As such, instruction decoder logic within each of thevarious processor cores may be configured to mark instructions that usea LOCK prefix. In addition, as described further below, processor corelogic may include hardware (shown in FIG. 2) that may enableidentification of the markers associated with LOCKed instructions. Thishardware may enable the use of the LOCK instruction prefix to identifycritical sections of code as part of the advanced synchronizationfacility.

To reduce the effects of interference caused by more than one processorattempting to access the same memory reference (e.g., critical sectionsof code) at the same time, the advanced synchronization facility andassociated hardware may be implemented within computer system 100. Aswill be described in greater detail below, the advanced synchronizationfacility may employ new instructions and use hardware such as asynchronization arbiter (shown in FIG. 2) which may be interconnectedwithin the cache coherent fabric. As shown in FIG. 2, synchronizationarbiter 230 is coupled to a Northbridge unit 290 of any processing node312, thus enabling the synchronization arbiter to observe explicitaddresses associated with the Advanced Synchronization Facilitytransactions of each node. The synchronization arbiter may be placedanywhere in the coherent domain of the interconnect network. It is notedthat although one synchronization arbiter is shown, it is contemplatedthat when a system is configured to support multiple virtual machines,and when these virtual machines do not share any actual physical memory,multiple synchronization arbiters can be configured to distribute thesynchronization load across several arbiters.

It is noted that the phrase “critical section” is used throughout thisdocument. A “critical section” refers to a section of code used in theadvanced synchronization facility that may include one or more memoryreference instructions marked with a LOCK prefix, an ACQUIREinstruction, and a RELEASE instruction which ends the critical section.In one embodiment, there are four stages of each critical section: 1)specifying the address(es) of the cache line(s) needed during thecritical section (e.g., entering the critical section), 2) going throughthe mechanics to acquire these cache lines, 3) atomically modifying thecritical section data, 4) releasing the cache lines back to the system.In particular, the critical section code will appear to be executedatomically by interested observers. The first phase may be referred toas the specification phase, while the third phase is often referred toas the atomic phase.

In various implementations, software may be allowed to perform ‘simple’arithmetic and logical manipulations on the data between reading andmodifying the data of the critical section as long as the simplearithmetic operations do not cause exceptions when executed. If a datamanipulation causes an exception inside a critical section, atomicity ofthat critical section may not be guaranteed. Critical section softwareshould detect failures of atomicity, and deal with them appropriately, sdescribed further below.

Generally, the advanced synchronization facility may utilize a weakenedmemory model and operate only upon cacheable data. This weakened memorymodel may prevent the advanced synchronization facility from wastingprocessor cycles waiting for various processor and memory buffers toempty before performing a critical section. However, when softwarerequires a standard PC strong memory model, software may insert LFENSE,SFENSE, or MFENSE instructions just prior to the RELEASE instruction toguarantee standard PC of memory ordering. For the case of usingcacheable synchronization to enable accesses to unCacheable data, anSFENSE instruction between the last LOCKed Store and the RELEASEinstruction will guarantee that the unCacheable data is globally visiblebefore the cacheable synchronization data is globally visible in anyother processor. This may enable maximum overlap of unCacheable andCacheable accesses with minimal performance degradation.

In various embodiments, interface logic 318A-318L may comprise a varietyof buffers for receiving packets from the link and for buffering packetsto be transmitted upon the link. Computer system 100 may employ anysuitable flow control mechanism for transmitting packets. In addition tointerface logic 318A-318L each processing node may include respectivebuffer interface units (BIU) 220 (shown in FIG. 2), which may providefunctionality to enable proactive synchronization. For example, asdescribed further below, BIU 220 may be configured to those specialaddresses that are associated with an Advanced Synchronization event andto transmit those addresses to synchronization arbiter 230 in responseto execution of an ACQUIRE instruction. The BIU 220 may also beconfigured to determine if the response received from synchronizationarbiter 230 indicates the addresses may be interference free. Dependingon whether the response indicates the addresses may not be interferencefree, BIU 220 may notify the requesting processor core of a failure bysending a failure count value to a register within the processor core18, and sending a completion message to synchronization arbiter 230, orwhen guaranteed to be interference free by allowing the execution of thecritical section, and waiting to send the completion message tosynchronization arbiter 230.

FIG. 2 is a block diagram that illustrates more detailed aspects ofembodiments of processing node 312A and synchronization arbiter 230 ofFIG. 1. Referring to FIG. 2, processing node 312A includes processorcores 18A and 18 n, where n may represent any number of processor cores.Since the processor cores may be substantially the same in variousembodiments, only detailed aspects of processor core 18A are describedbelow. As shown, processor cores 18A and 18 n are coupled to businterface unit 220 which is coupled to a Northbridge unit 290, which iscoupled to memory controller 316A, HyperTransport™ interface logic318A-318C, and to synchronization arbiter 230 via a pair ofunidirectional links 324I-324J.

Processor core 18A includes hardware configured to execute instructions.More particularly, as is typical of many processors, processor core 18Aincludes one or more instruction execution pipelines including a numberof pipeline stages, cache storage and control, and an addresstranslation mechanism (only pertinent portions of which are shown forbrevity). Accordingly, as shown processor core 18A includes a level one(L1) instruction cache, prefetch logic, and branch prediction logic.Since these blocks may be closely coupled with the instruction cache,they are shown together as block 250. Processor core 18A also includesan L1 data cache 207. Processor core 18A also includes instructiondecoder 255 and an instruction dispatch and control unit 256 may becoupled to receive instructions from instruction decoder 255 and todispatch operations to a scheduler 259. Further, instruction dispatchand control unit 256 may be coupled to a microcode read-only memory(MROM) (not shown). Scheduler 259 is coupled to receive dispatchedoperations from instruction dispatch and control unit 256 and to issueoperations to execution units 260. In various implementations, executionunits 260 may include any number of integer execution units andfloating-point units. Further, processor core 18A includes a TLB 206 anda load/store unit 270. It is noted that in alternative embodiments, anon-chip L2 cache may be present (although not shown).

Instruction decoder 255 may be configured to decode instructions intooperations which may be either directly decoded or indirectly decodedusing operations stored within the MROM. Instruction decoder 255 maydecode certain instructions into operations executable within executionunits 260. Simple instructions may correspond to a single operation,while in other embodiments, more complex instructions may correspond tomultiple operations. In one embodiment, instruction decoder 255 mayinclude multiple decoders (not shown) for simultaneous decoding ofinstructions. Each instruction may be aligned and decoded into a set ofcontrol values in multiple stages depending on whether the instructionsare first routed to MROM. These control values may be routed in aninstruction stream to instruction dispatch and control unit 257 alongwith operand address information and displacement or immediate datawhich may be included with the instruction. As described further below,when a memory reference instruction includes a LOCK prefix, instructiondecoder may identify the address with a marker.

Load/store unit 270 may be configured to provide an interface betweenexecution units 260 and data cache 207. In one embodiment, load/storeunit 270 may include load/store buffers with several storage locationsfor data and address information for pending loads or stores. As such,the illustrated embodiment includes LS1 205, linear LS2 209, physicalLS2 210, and data storage 211. Further, processor core 18A includesmarker logic 208, and a marker bit 213.

In one embodiment, a critical section may be processed in one of twoways: deterministically, and optimistically. The choice of execution maybe based upon the configuration of the advanced synchronization facilityand upon the state of a critical section predictor, as described ingreater detail below. In various embodiments, either the basic inputoutput system (BIOS), the operating system (OS), or a virtual memorymanager (VMM) may configure the operational mode of the advancedsynchronization facility. When operating in the deterministic executionmode, the addresses specified by the locked memory referenceinstructions may be bundled up and sent enmasse to the synchronizationarbiter 230 to be examined for interference. The cache line data may beobtained and the critical section executed (as permitted). In contrast,when operating in the optimistic synchronization mode, no interferencemay be assumed, and the critical section may be executed (bypassing thesynchronization arbiter 230) and if any other processor interferes withthis critical section, the interference will be detected and then theprocessor backs up to the ACQUIRE instruction and redirects control flowaway from the atomic phase.

To implement the deterministic mode, the advanced synchronizationfacility may use the synchronization arbiter 230. As described above,synchronization arbiter 230 examines all of the physical addressesassociated with a synchronization request and either pass (a.k.a. bless)the set of addresses or fail (i.e., reject) the set of addresses, basedupon whether any other processor core or requestor is operating on orhas requested those addresses while they are being operated on. As such,synchronization arbiter 230 may allow software to be constructed thatproactively avoids interference. When interference is detected bysynchronization arbiter 230, synchronization arbiter 230 may respond toa request with a failure status including a unique number (e.g., countvalue 233) to a requesting processor core. In one embodiment, the countmay be indicative of the number of requesters contending for the memoryresource(s) being requested. Software may use this number to proactivelyavoid interference in subsequent trips through the critical section byusing this number to choose a different resource upon which to attempt acritical section access.

Accordingly, as shown in FIG. 2, synchronization arbiter 230 includes astorage 232 including a number of entries, control logic 234, andcompare unit 231. Each of the entries may store one or more physicaladdresses of requests currently being operated on. In one embodiment,each entry may store up to eight physical addresses that are transportedas a single 64-byte request. In addition, the synchronization arbiterentry includes the count value 233, which corresponds to all theaddresses in the entry. As described above, the count value may beindicative of the number of requesters (i.e., interferers) that arecontending for any of the addresses in a critical section. Whensynchronization arbiter 230 receives a set of addresses, a compare unit231 within synchronization arbiter 230 checks for a match between eachaddress in the set and all the addresses in storage 232. If there is nomatch, control logic 234 may be configured to issue a pass response byreturning a passing count value and to store the addresses withinstorage 232. In one embodiment, the passing count value is zero,although suitable count value may be used. However, if there is anaddress match, control logic 234 may increment the count value 233associated with set of addresses that includes the matching address, andthen return that count value as part of a failure response. It is notedthat compare unit 231 may be a compare only structure implemented in avariety of ways, as desired. In addition, in another embodiment, eachaddress stored within storage 232 may be associated with a respectivecount. As such, the count value may be indicative of the number ofrequesters (i.e., interferers) that are contending for one of therespective address in a critical section.

In the illustrated embodiment, bus interface unit (BIU) 220 includes acount compare circuit 221, a locked line buffer (LLB) 222, and apredictor 223. BIU 220 may also include various other circuits fortransmitting and receiving transactions from the various components towhich it is connected, however, these have been omitted for clarity. Assuch, BIU 220 may be configured to transmit a set of addressesassociated with a critical section from LLB 222 to synchronizationarbiter 230 in response to the execution of an ACQUIRE instruction. Inaddition, compare circuit 221 may be configured to compare the countvalue returned by synchronization arbiter 230 to check if the count is apassing count value (e.g., zero) or a failing count value. It is notedthat SBB 22 may be implemented using any type of storage structure. Forexample, it may be part of an existing memory address buffer (MAB) orseparate, as desired.

As described above, if processor core 18 is operating in thedeterministic synchronization mode, addresses associated with a criticalsection may be marked during instruction decode by using the LOCKprefix. More particularly, memory references that explicitly participatein advanced synchronization code sequences are annotated by using theLOCK prefix with an appropriate MOV instruction. LOCKed Loadinstructions may have the following form:

-   -   LOCK MOVx reg,[B+I*s+DISP].        More particularly, a regular x86 memory read instruction is made        special by attaching a LOCK prefix. This causes the BIU 220 to        gather the associated marked physical address into the LLB 222        as the address passes through the L1 cache (and TLB 206). In        addition, memory access strength is reduced to access the line        (in the case of a cache miss) without write permission (ReadS,        not ReadM or Read). The Load instruction may not be retired out        of LS2 until the ACQUIRE instruction returns from the        synchronization arbiter 230.

While the request form BIU 220 (to synchronization arbiter 230) isawaiting a response, the LLB 222 watches for Probes with INValidatesemantics, and if one (or more) occurs, the ACQUIRE instruction will bemade to fail, even if synchronization arbiter 230 returns a success. TheLOCK prefix does not cause any particular locking of the cache or bus,but simply provides a convenient marker to be added to memory based MOVeinstructions. As such, LOCKed MOV to register instructions (which may beotherwise referred to as LOCKed Loads) may be processed normally downthe data cache pipeline.

Accordingly, during address translation each linear address may bestored within linear address portion of LS2 209. The correspondingphysical addresses may be stored in TLB 206 and within physical LS2 210,while the corresponding data may be stored within data cache 207 anddata.LS2 211. Marker logic 208 may detect the LOCK prefix markergenerated during decode and generate an additional marker bit 213,thereby marking each such address as a participant in a criticalsection. Any LOCKed Load that takes a miss in the data cache may haveits cache line data fetched through the memory hierarchy withRead-to-Share access semantics, however write permission to thatspecified memory resource is checked.

As described above, if processor core 18 is operating in a deterministicsynchronization mode, addresses associated with a critical section maybe marked during instruction decode by using the LOCK prefix. Moreparticularly, memory prefetch references that explicitly participate inadvanced synchronization code sequences are annotated by using the LOCKprefix with an appropriate PREFETCHW instruction. These types of LOCKedLoad instructions may have the following form:

-   -   LOCK PREFETCHW [B+I*s+DISP].        Thus, a regular memory PREFETCHW instruction is made special by        attaching a LOCK prefix. This causes the BIU 220 to gather the        associated marked physical address into the LLB 222 as the        address passes through the L1 cache (and TLB 206). In addition,        memory access strength is reduced to avoid an actual DRAM access        the line. The PREFETCHW instruction may not be retired out of        LS2 until the ACQUIRE instruction returns from synchronization        arbiter 230. These instructions may be used to touch cache lines        that participate in the critical section and that need data        (e.g., a pointer) in order to touch other data also needed in        the critical section. At the conclusion of the specification        phase, an ACQUIRE instruction is used to notify BIU 220 that all        memory reference addresses for the critical section are stored        in LLB 222.

The ACQUIRE instruction may have the form

-   -   ACQUIRE reg, imm8        The ACQUIRE instruction checks that the number of LOCKed memory        reference instructions is equal to the immediate value in the        ACQUIRE instruction. If this check fails, the ACQUIRE        instruction terminates with a failure code, otherwise, the        ACQUIRE instruction causes BIU 220 to send all addresses stored        within LLB 222 to the synchronization arbiter 230. This        instruction ‘looks’ like a memory reference instruction on the        data path so that the count value returned from the        synchronization arbiter 230 can be used to confirm (or deny)        that all the lines can be accessed without interference. No        address is necessary for this ‘load’ instruction because there        can be only one synchronization arbiter 230 per virtual machine        or per system. The register specified in the ACQUIRE instruction        is the destination register of processor core 18.

In one embodiment, the semantics of a LOCKed Load operation may includeprobe monitoring the location for a PROBE. If a PROBE is detected for alocation, the LS1 or LS2 queue may return a failure status withoutwaiting for the read to complete. A general-purpose fault (#GP) may begenerated if the number of LOCKed loads exceeds a micro-architecturallimit. If an ACQUIRE instruction fails, the count of LOCKed loads willbe reset to zero. If the address is not to a Write Back memory type, theACQUIRE instruction can be made to fail (when subsequently encountered).

It is expected that some critical sections may include a number ofarithmetic and control flow decisions to compute what data modificationsmay be appropriate (if any). However, software should arrange that thesetypes of instructions never cause an actual exception. In oneembodiment, arithmetic and memory reference instructions may beprocessed in either the SSE registers (XMM), or in the general-purposeregisters (e.g., EAX, etc), or in the MMX or x87 registers.

As described above, synchronization arbiter 230 may either pass therequest enmasse or fail the request enmasse. If synchronization arbiter230 fails the request, the response back to BIU 220 may be referred toas a “synchronization arbiter Fail-to-ACQUIRE” with the zero bit set(e.g., RFLAGS.ZF). As described above, the response returned bysynchronization arbiter 230 may include the count value 233, which maybe indicative of the number of interferers. Software may use this countto reduce future interference as described above. The count value 233from the synchronization arbiter 230 may be delivered to ageneral-purpose register (not shown) within processor core 18 and mayalso be used to set condition codes. If the synchronization arbiter 230passes the request, the response back to BIU 220 may include a passcount value (e.g., zero).

In one embodiment, if the synchronization arbiter address storage 232 isfull, the request may be returned with a negative count value such asminus one (−1), for example. This may provide software running on theprocessor core a means to see an overload in the system and to enablethat software to stop making requests to synchronization arbiter 230 fora while. For example, the software may schedule something else or it maysimply waste some time before retrying the synchronization attempt.

If the count is zero (meaning there are no interferers observed bysynchronization arbiter 230), processor core 18 may execute theinstructions in the atomic phase and manipulate the data in the cachelines as desired. When the data manipulation is complete, a RELEASEinstruction is executed signifying the end of the critical section. Inone embodiment, the RELEASE instruction enables all of the modified datato become visible substantially simultaneously by sending the RELEASEmessage to synchronization arbiter 230, thereby releasing the associatedcache lines back to the system.

In one embodiment, the advanced synchronization facility supports twokinds of failures, a “Fail-to-ACQUIRE” and a “Fail-to-REQUESTOR”. TheFail-to-ACQUIRE failure causes the ACQUIRE instruction to complete withthe zero bit set (e.g., RFLAGS.ZF) so that the subsequent conditionaljump instruction can redirect control flow away from damage inducinginstructions in the atomic phase. The synchronization arbiterFail-to-ACQUIRE with the zero bit set (e.g., RFLAGS.ZF) is one type ofFail-to-ACQUIRE failure. A processor Fail-to-ACQUIRE is another type. Inone embodiment, during execution of critical sections, processor coresmay communicate by observing memory transactions. These observations maybe made visible at the ACQUIRE instruction of an executing processorcore. More particularly, during the time between the start of collectingof the addresses necessary for a critical section and the response ofsynchronization arbiter 230, processor core 18 monitors all of thoseaddresses for coherent invalidation probes (e.g., Probe withINValidate). If any of the lines are invalidated, the response fromsynchronization arbiter 230 may be ignored and the ACQUIRE instructionmay be made to fail with the zero bit set (e.g., RFLAGS.ZF).

The Fail-to-REQUESTOR failure may be sent as a PROBE response if thereis a cache hit on a line that has been checked for interference andpassed by synchronization arbiter 230. A Fail-to-REQUESTOR responsecauses the requesting processor to Fail-to-ACQUIRE if it is currentlyprocessing an advanced synchronization facility critical section, or itwill cause the requesting processor's BIU to re-request that memoryrequest if it is not processing the critical section. As such, BIU 220may be configured to cause a Fail-to-ACQUIRE in response to receiving aProbe with INValidate prior to obtaining a pass notification fromsynchronization arbiter 230.

Once the addresses of the critical section have been acquired, aprocessor core 18 that has had its addresses passed by synchronizationarbiter 230 may obtain each cache line for exclusive access (e.g. writepermission) as memory reference instructions are processed in the atomicphase. After a passed cache line arrives, processor core 18 may holdonto that cache line and prevent other processor cores from stealing theline by responding to coherent invalidation probes withFail-to-REQUESTOR responses. It is noted that Fail-to-REQUESTOR may alsobe referred to as a negative-acknowledgement (NAK).

As described above, when a processor receives a Fail-to-REQUESTOR and itis currently participating in an advanced synchronization instructionsequence, that instruction sequence will be caused to fail at theACQUIRE instruction. In this case, the subsequent conditional jump istaken and the damage inducing part of the memory reference instructionsin the critical section may be avoided. However, when a processorreceives a Fail-to-REQUESTOR and is not participating in an advancedsynchronization instruction sequence, the requesting processor's BIU mayjust re-request the original memory transaction. Thus, the elapsed timebetween the sending of the Fail-to-REQUESTOR and the subsequent arrivalof the next coherent invalidation probe at the passed critical sectionenables forward progress of the processor with the synchronizationarbiter's blessing to be guaranteed. The guarantee of forward progressenables the advanced synchronization facility to be more efficient undercontention than currently existing synchronization mechanisms.Accordingly, sooner or later, both the critical section and theinterfering memory reference may both be performed (e.g., no live-lock,nor dead-lock).

As mentioned above, the performance of a processor participating in theAdvanced Synchronization Facility may be optimized by using a criticalsection predictor 223. Initially predictor 223 may be set up to predictthat no interference is expected during execution of a critical section.In this mode, processor core 18 may not actually use the synchronizationarbiter 230. Instead processor core 18 may record the LOCKed memoryreferences and may check these against Coherent Invalidation PROBEs todetect interference. If the end of the critical section is reachedbefore any interference is detected, no interested third party has seenthe activity of the critical section and it has been performed as if itwas executed atomically. This property enables the AdvancedSynchronization Facility to be processor-cycle competitive withcurrently existing synchronization mechanisms when no contention isobserved.

More particularly, when interference is detected, processor core 18 maycreate a failure status for the ACQUIRE instruction and the subsequentconditional branch redirects the flow of control out of the criticalsection, and resets the predictor to predict deterministic mode. Whenthe next critical section is detected, the decoder will then predictinterference might happen, and will process the critical section usingthe synchronization arbiter 230 (if enabled).

In one embodiment, the Advanced Synchronization facility may operate onmisaligned data items as long as these items do not span cache linesthat are not participating in the actual critical section. Software isfree to have synchronization items span cache line boundaries as long asall cache lines so touched are recognized as part of the criticalsection entry. When a data item spans a cache line into another cacheline that was not part of the synchronization communication, theprocessor neither detects the failure of atomicity nor signals the lackof atomicity.

Further, access to critical section data may be dependent upon thepresence of that data in main memory. All of the lines necessary for thecritical section are touched before ENTRY into the critical section, andany access rights issues or page-faulting issues may be detected whenthe LOCKed Load or LOCKed PREFETCHW instructions execute prior toentering the critical section. When any of the lead-in addresses take afault, the subsequent ACQUIRE instruction is made to fail. After entryto the critical section, if any instruction causes an exception, theprocessor will causes a failure at the ACQUIRE instruction, and thesubsequent conditional jump redirects control away from the criticalsection.

In one embodiment, if the decoder of processor core 18 must take aninterrupt, it may arrange that the ACQUIRE instruction will fail withthe zero bit set (e.g., RFLAGS.ZF), and take the interrupt at theACQUIRE instruction.

It is noted that in embodiments in which synchronization arbiter 230 isconnected within a North Bridge implementation within theHyperTransport™ fabric, synchronization arbiter 230 may be assigned apredetermined and/or reserved node ID that no other component may have.This assignment may be made at boot time by the BIOS, for example. Inaddition, in the above embodiments, the count value may be returned as a64-bit value, although other values are contemplated.

FIG. 3 is a flow diagram describing the operation of the embodiments ofthe computer system shown in FIG. 1 and FIG. 2. Referring collectivelyto FIG. 1 through FIG. 3, and beginning in block 405 addresses of cachelines that are currently being operated on or accessed as part of acritical section are maintained in a list (e.g., within LLB 222). Forexample, synchronization arbiter 230 may store the addressescorresponding to a critical section, as a set, within an entry ofaddress storage 232. In one embodiment, each entry of address storage232 may also store a count value that is associated with the whole setof addresses stored therein (block 410). As described above, the countvalue may be indicative of the number of contenders (i.e., interferers)for any of the addresses in the set. In another embodiment,synchronization arbiter 230 may store a number of count values withineach entry, such that each address in the entry has a an associatedcount value.

When a processor or processor core that is implementing the advancedsynchronization facility, requests exclusive access to one or more cachelines, the request comes in the form of a critical code section. Forexample, as described above, to ensure completion of the instructions inan atomic manner (as viewed by all outside observers) a critical sectionmay include the use of LOCKed MOV instructions, followed by an ACQUIREinstruction and a RELEASE instruction (block 415). Accordingly, the setof addresses that are requested are checked for interference. In oneembodiment, the set of addresses is compared to all of the addresseswithin address storage 232 (block 420). In the embodiments describedabove, the LOCKed MOV instructions cause the addresses to be marked. Themarker causes BIU 220 to store each marked address in LLB 222. TheACQUIRE instruction causes BIU 220 to send the entire set of address inLLB 222 to synchronization arbiter 230 in the form of an unCacheablewrite that carries 64-bytes of physical address data. Synchronizationarbiter 230 compares the set of addresses to all the addresses in thestorage 232.

If there is a match on any address (block 425), the count valueassociated with the matching address is incremented (block 455) and thenew count value is returned to BIU 220 as a part of a failure responseto the unCacheable write (block 460) that carries 64-bits of responsedata. In addition, synchronization arbiter 230 discards the set ofaddresses upon failure. BIU 220 sends the failure count value to theregister of the requesting processor/core, which may also set conditioncode flags. As a result, the requesting processor/core may use the countvalue to select another set of memory resources in subsequent operations(block 465) and avoid interference on its subsequent synchronizationattempt. Operation proceeds as described above in block 415.

Referring back to block 425, if there is no matching address in storage232, synchronization arbiter 230 may return a passing count value (e.g.,zero) to BIU 220 (block 430). In addition, synchronization arbiter 230may store the set of addresses in an entry of storage 232 (block 435).BIU 220 may send the passing count value to the requestingprocessor/core register specified in the ACQUIRE instruction. As such,the requesting processor/core may manipulate or otherwise operate on thedata at the requested addresses (block 440). If the operation is notcomplete (block 445), BIU 220 defers sending a completion message tosynchronization arbiter 230. When the operation in the critical sectionis complete such as when the RELEASE instruction is executed, BIU 220may send a completion message to synchronization arbiter 230. Uponreceiving the completion message, synchronization arbiter 230 may flushthe corresponding addresses from storage 232, thereby releasing thoseaddresses back to the system (block 450) for use by anotherprocessor/core. In addition, load/store unit 270 updates the data cachefor all instructions in that critical section that retired.

As described above, if a coherency invalidation probe hits on an addressin the critical section during processing of the critical section, theresponse to that probe may be dependent upon the state of processing ofthe critical section (i.e., whether or not the cache lines have beenacquired). FIG. 4 is a flow diagram describing the operation of theembodiments of FIG. 1 and FIG. 2 when a coherency invalidation probe isreceived.

Referring collectively to FIG. 1 through FIG. 4 and beginning in block505 of FIG. 4, a Probe is received and hits on a critical sectionaddress in load store unit 270. If the requested lines have beensuccessfully acquired (block 510), (e.g., a coherency invalidation probeis received after synchronization arbiter 230 has provided a pass countvalue, and stored the set of addresses within storage 232), BIU 220 maysend a Failure-to-Requestor response as a response to the probe (block515). At the requesting processor core, this Failure-to-Requestorresponse should cause a failure of the ACQUIRE instruction if theprocessor core was operating in a critical section, or a retry of theaddresses if not.

Referring back to block 510, if the requested lines have been acquired,the processor core may ignore any count value received formsynchronization arbiter 230 (block 520). Load/store unit 270 may notifyinstruction dispatch and control unit 257 that there is a probe hit(e.g., Prb hit signal), and thus there is a Failure-to-Acquire. As such,the ACQUIRE instruction is made to fail, as described above. As such, toan outside observer the ACQUIRE instruction simply failed.

It is noted that although the computer system 100 described aboveincludes processing nodes that include one or more processor cores, itis contemplated that in other embodiments, the advanced synchronizationfacility and associated hardware may be implemented using stand-aloneprocessors or a combination of processing nodes and stand-aloneprocessors, as desired. In such embodiments, each stand-alone processormay include all or part of the above described hardware and may becapable of executing the instructions that are part of the advancedsynchronization facility. As such the terms processor and processor coremay be used somewhat synonymously, except when specifically enumeratedto be different.

Code and/or data that implements the functionality described in thepreceding sections may also be provided on computer accessible/readablemedium. Generally speaking, a computer accessible/readable medium mayinclude any media accessible by a computer during use to provideinstructions and/or data to the computer. For example, a computeraccessible medium may include storage media such as magnetic or opticalmedia, e.g., disk (fixed or removable), CD-ROM, or DVD-ROM, CD-R, CD-RW,DVD-R, DVD-RW, volatile or non-volatile memory media such as RAM (e.g.synchronous dynamic RAM (SDRAM), Rambus DRAM (RDRAM), static RAM (SRAM),etc.), ROM, Flash memory, non-volatile memory (e.g. Flash memory)accessible via a peripheral interface such as the Universal Serial Bus(USB) interface, etc., as well as media accessible via transmissionmedia or signals such as electrical, electromagnetic, or digitalsignals, conveyed via a communication medium such as a network and/or awireless link.

Although the embodiments above have been described in considerabledetail, numerous variations and modifications will become apparent tothose skilled in the art once the above disclosure is fully appreciated.It is intended that the following claims be interpreted to embrace allsuch variations and modifications.

1. A synchronization arbiter for use in a computer system including oneor more processors each configured to request exclusive access to agiven memory resource, wherein the request includes one or moreaddresses associated with the given memory resource, the synchronizationarbiter comprising: an address storage configured to store a pluralityof sets of addresses, wherein each address of the plurality of sets ofaddresses corresponds to a respective memory resource to which arequester has acquired exclusive access; wherein the address storage isfurther configured to store a plurality of count values each associatedwith a respective set of addresses of the plurality of sets ofaddresses, wherein each count value is indicative of a number ofrequestors contending for any address in each respective set ofaddresses; a compare unit coupled to the address storage and configuredto compare each of the one or more addresses in the request to eachaddress of the plurality of sets of addresses stored in the addressstorage; and control logic coupled to the compare unit and configured toreturn to the requester, the count value associated with a matchingaddress in response to any address of the one or more addresses matchingany address in the plurality of sets of addresses.
 2. Thesynchronization arbiter as recited in claim 1, wherein the control logicis further configured to return to the requestor, a predetermined countvalue in response to no address of the one or more addresses matchingany address in the plurality of sets of addresses.
 3. Thesynchronization arbiter as recited in claim 2, wherein the predeterminedcount value comprises a pass count value of zero.
 4. The synchronizationarbiter as recited in claim 1, wherein the control logic is furtherconfigured to store the one or more addresses within the address storagein response to no address of the one or more addresses matching anyaddress in the plurality of sets of addresses.
 5. The synchronizationarbiter as recited in claim 1, wherein the address storage comprises aplurality of entries, wherein each entry is configured to store one setof the plurality of sets of addresses and the associated count value. 6.The synchronization arbiter as recited in claim 1, wherein each addresscorresponds to a physical address of a 64-byte cache line.
 7. Thesynchronization arbiter as recited in claim 1, wherein each set ofaddresses of the plurality of sets of addresses comprises up to eightphysical addresses.
 8. The synchronization arbiter as recited in claim1, wherein the control logic is further configured to remove a set ofaddresses from the address storage in response to receiving anotification of completion of operations on corresponding data in thegiven memory resource.
 9. The synchronization arbiter as recited inclaim 1, wherein the control logic is further configured to increase thecount value associated with the set of addresses including the matchingaddress prior to returning the count value.
 10. The synchronizationarbiter as recited in claim 1, further comprising one or morecommunications interfaces for connection to the one or more processorsvia one or more communications links.
 11. The synchronization arbiter asrecited in claim 10, wherein the one or more communications linkscomprise pairs of unidirectional packet-based links.
 12. A computersystem comprising: one or more processors coupled together and to one ormore memories, wherein each of the processors is configured to requestexclusive access to a given memory, wherein the request includes one ormore addresses associated with the given memory; and a synchronizationarbiter coupled to each of the one or more processors, wherein thesynchronization arbiter includes: an address storage configured to storea plurality of sets of addresses, wherein each address of the pluralityof sets of addresses corresponds to a respective memory to which arequester has acquired exclusive access; wherein the address storage isfurther configured to store a plurality of count values each associatedwith a respective set of addresses of the plurality of sets ofaddresses, wherein each count value is indicative of a number ofrequesting processors contending for any address in each respective setof addresses; a compare unit coupled to the address storage andconfigured to compare each of the one or more addresses in the requestto each address of the plurality of sets of addresses stored in theaddress storage; and control logic coupled to the compare unit andconfigured to return to the requesting processor, the count valueassociated with a matching address in response to any address of the oneor more addresses matching any address in the plurality of sets ofaddresses.
 13. The computer system as recited in claim 12, wherein thecontrol logic is further configured to return to the requestor, apredetermined count value in response to no address of the one or moreaddresses matching any address in the plurality of sets of addresses.14. The computer system as recited in claim 13, wherein thepredetermined count value comprises a pass count value of zero.
 15. Thecomputer system as recited in claim 12, wherein the control logic isfurther configured to store the one or more addresses within the addressstorage in response to no address of the one or more addresses matchingany address in the plurality of sets of addresses.
 16. The computersystem as recited in claim 12, wherein the address storage comprises aplurality of entries, wherein each entry is configured to store one setof the plurality of sets of addresses and the associated count value.17. The computer system as recited in claim 12, wherein each addresscorresponds to a physical address of a 64-byte cache line.
 18. Thecomputer system as recited in claim 12, wherein each set of addresses ofthe plurality of sets of addresses comprises up to eight physicaladdresses.
 19. The computer system as recited in claim 12, wherein thecontrol logic is further configured to remove a set of addresses fromthe address storage in response to receiving a notification ofcompletion of operations on corresponding data in the given memoryresource.
 20. The computer system as recited in claim 12, wherein thecontrol logic is further configured to increase the count valueassociated with the set of addresses including the matching addressprior to returning the count value.
 21. The computer system as recitedin claim 12, wherein the one or more processors and the synchronizationarbiter are interconnected via a plurality of communications links eachcomprising a pair of unidirectional packet-based links
 22. The computersystem as recited in claim 12, wherein each of the one or moreprocessors is further configured to use the count value to requestexclusive access to a different memory resource including a differentset of addresses.
 23. A synchronization arbiter for use in a computersystem including one or more processors each configured to requestexclusive access to a given memory resource, wherein the requestincludes one or more addresses associated with the given memoryresource, the synchronization arbiter comprising: an address storageconfigured to store a plurality of sets of addresses, wherein eachaddress of the plurality of sets of addresses corresponds to arespective memory resource to which a requestor has acquired exclusiveaccess; wherein the address storage is further configured to store aplurality of count values, each count value associated with a respectiveaddress of each set of the plurality of sets of addresses, wherein eachcount value is indicative of a number of requestors contending for theassociated respective address; a compare unit coupled to the addressstorage and configured to compare each of the one or more addresses inthe request to each address of the plurality of sets of addresses storedin the address storage; and control logic coupled to the compare unitand configured to return to the requester, the count value associatedwith a matching address in response to any address of the one or moreaddresses matching any address in the plurality of sets of addresses.24. The synchronization arbiter as recited in claim 23, wherein thecontrol logic is further configured to store the one or more addresses,as a set, within the address storage in response to no address of theone or more addresses matching any address in the plurality of sets ofaddresses.
 25. The synchronization arbiter as recited in claim 23,wherein the control logic is further configured to store the one or moreaddresses using a sequence of store operations within the addressstorage in response to no address of the one or more addresses matchingany address in the plurality of sets of addresses.
 26. Thesynchronization arbiter as recited in claim 23, wherein the controllogic is further configured to remove a set of addresses from theaddress storage in response to receiving a notification of completion ofoperations on corresponding data in the given memory resource.
 27. Thesynchronization arbiter as recited in claim 23, wherein the controllogic is further configured to remove addresses from the address storagein a sequence of operations in response to receiving a notification ofcompletion of operations on corresponding data in the given memoryresource.